Unit 3 - Transforming DataFrames and Series


In [ ]:
import pandas as pd

In [ ]:
def make_data(n_rows=None):
    columns = ['room_id', 'host_id', 'room_type', 'neighborhood', 'reviews', 'overall_satisfaction',
               'accommodates', 'bedrooms', 'price']
    df = pd.read_csv('../data/airbnb_input.csv', usecols=columns, index_col='room_id').sort_index()
    if n_rows:
        df = df.head(n_rows)
    
    return df


def make_location_data():
    columns = ['room_id', 'latitude', 'longitude']
    df = pd.read_csv('../data/airbnb_input.csv', usecols=columns, index_col='room_id').sort_index()
    return df

df = make_data(5)

df is the same DataFrame that we used in Unit 2. You can get it by calling function make_data.

Dropping rows and columns

In order to drop rows and columns from a DataFrame, you can use funtion drop.

Drop has two important arguments:

  • inplace: with this argument, you can chose if you want to transform the original DataFrame or if you want the drop function to return a copy of the transformed DataFrame. It's default value is False, i.e, you don't apply the transformation in the original DataFrame. You'll see this argument in many functions that transform DataFrames.
  • axis: with this argument, you chose if you want to drop rows (axis=0) or if you want to drop columns (axis=1). The default behaviour is to drop rows. You'll se this argument in many functions that transform DataFrames.

Examples...

Dropping row with index 6499 and checking that the original DataFrame df didn't change:


In [ ]:
df1 = df.drop(6499)
print(6499 in df.index)
df1

While with inplace=True, we do change the original DataFrame:


In [ ]:
df.drop(6499, inplace=True)
print(6499 in df.index)
df

Now, go ahead and try to drop a list of rows!


In [ ]:
# code to drop a list of rows

In order to drop columns, we use the axis=1 argument:


In [ ]:
df = df.drop(['reviews', 'price'], axis=1)
df

Copying DataFrames

Sometimes it's useful to keep an unchanged version of your DataFrame.

When doing that, be sure not to fall on this mistake:


In [ ]:
df = make_data(n_rows=5)
new_df = df # new_df is just another way to refer to the same DataFrame as df
df.drop(['room_type', 'neighborhood', 'reviews'], axis=1, inplace=True)
new_df

In this example, we are just declaring a new way to refer to the original DataFrame. So, both the transformations that you apply on df and on new_df will impact the same DataFrame.

If you want to make a copy of a DataFrame, you should use the function copy:


In [ ]:
df = make_data(n_rows=5)
new_df = df.copy() # new_df now refers to a copy of df, we actually have two DataFrames
df.drop(['room_type', 'neighborhood', 'reviews'], axis=1, inplace=True)
new_df

Basic math operations

Now, we'll see examples on how to apply arithmetic operations (add, subtract, multiply, divide) between:

  • a constant and a DataFrame column
  • two DataFrame columns

Multiplying a DataFrame column by a constant

Getting the rooms' price per week (7 nights):


In [ ]:
df = make_data(5)
df['price_per_week'] = df.price.multiply(7) # or df['price_per_week'] = df.price * 7
df.head()

Dividing a DataFrame column by another column

Getting the people per bedroom ratio:


In [ ]:
df = make_data(5)
df['people_per_bedroom'] = df.accommodates.divide(df.bedrooms)
# or df['people_per_bedroom'] = df.accommodates / df.bedrooms
df.head()

String operations

In Pandas Series and Indexes have a set of string processing methods that can be accessed through the str attribute.

For instance, if we want to make neighborhood names more machine friendly, we can make the neighborhood names all lower case and replace spaces with underscores:


In [ ]:
df.neighborhood = df.neighborhood.str.replace(' ', '_').str.lower()
df.head()

We can also want to remove some pattern from the strings in a certain column:


In [ ]:
df.room_type = df.room_type.str.strip('/apt')
df.head()

To give an example using the str attribute on Indexes, we'll convert the room_id index into a string and then append it the host_id separated by an underscore:


In [ ]:
df = make_data(5)
df = df.reset_index()
df.room_id = df.room_id.astype(str)
df = df.set_index('room_id')
df.index = df.index.str.cat(df.host_id.astype(str), sep='_')
df

Group by

In Pandas, group by refers two a process of three chained steps: split-apply-combine.

  • split: splitting the DataFrame into groups
  • apply: apply a function to each group (aggregation, transformation and filtration)
  • combine: create a DataFrame with the results

For instance, if we want to know how many rooms does each landlord have:


In [ ]:
df.groupby('host_id')

In [ ]:
df = make_data()
# reset df index so that we have the room_id as a column
df = df.reset_index()
# only selecting columns room_id and host_id from the DataFrame
df = df[['room_id', 'host_id']]
# groupby() returns a DataFrameGroupBy object and count()
# is the aggregate function that counts room_ids in each group
df = df.groupby('host_id').count()
# now we don't have room_id anymore, we have room counts instead
# so the column name should be changed
df = df.rename(columns={'room_id': 'room_counts'})
df.head()

By default, the columns on which we perform the group by, become indexes. If we don't want this behaviour, we can use the argument as_index=False.


In [ ]:
df = make_data()
df = df.reset_index()
df = df[['room_id', 'host_id']]
df = df.groupby('host_id', as_index=False).count()
df = df.rename(columns={'room_id': 'room_counts'})
df.head(10).tail(5)

We can use group by to learn the average price per room type, for each landlord.


In [ ]:
df = make_data()
df = df[['host_id', 'room_type', 'price']]
df = df.groupby(['host_id', 'room_type']).mean()
df.head(10).tail(5)

Now a small challenge! Check the split-apply-combine documentation and try to find examples, with our dataset, for:

  • a group by with a transformation on the apply step
  • a group by with a filtration on the apply step

In [ ]:
# write the code for the challenge here

Bonus content - Merge and concatenate DataFrames

If you have the time, after the practical exercises, explore the merge and concat functions.

In particular, try to solve these exercises:

  • Use function make_location_data to get the locations for each room.
  • Merge df with the locations DataFrame so that you have two new columns (latitude and longitude) on df.
  • Obtain the same result as the last question by using the concat function instead